Loan Data Analysis by Andrew Voorhees

{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE,warning=FALSE, message=FALSE)

This report explores a dataset containing over 100,000 loans from the lending company Prosper.

Univariate Plots Section

## [1] 113937     81

Although there are 81 fields, I want to concern myself with only a limited number of these for my analysis. The fields I chose to focus on were:

  1. CreditGrade
  2. ListingCreationDate
  3. Term
  4. LoanStatus
  5. BorrowerAPR
  6. Prosper Rating
  7. ProsperScore
  8. ListingCategory
  9. BorrowerState
  10. CreditRangeLower / CreditRangeUpper
  11. StatedMonthlyIncome
  12. LoanOriginalAmount
  13. Investors
keep = c('CreditGrade', 
         'ListingCreationDate', 
         'Term',
         'BorrowerAPR',
         'ProsperRating..numeric.', 
         'ProsperScore',
         'ListingCategory..numeric.',
         'CreditScoreRangeLower',
         'CreditScoreRangeUpper',
         'StatedMonthlyIncome',
         'LoanOriginalAmount',
         'Investors')
data = data[keep]

str(data)
## 'data.frame':    113937 obs. of  12 variables:
##  $ CreditGrade              : Factor w/ 8 levels "A","AA","B","C",..: 4 NA 7 NA NA NA NA NA NA NA ...
##  $ ListingCreationDate      : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ ProsperRating..numeric.  : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperScore             : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ Investors                : int  258 1 41 158 20 1 1 1 1 1 ...

Next I want to do some basic data wrangling to organize my data.

# Want Credit Grade to be an ordered Factor
data$CreditGrade = factor(data$CreditGrade, 
                          ordered=T, 
                          levels=c("AA", "A", "B", "C",
                                   "D",  "E",  "HR", "NC"))
# Creation date
data$ListingCreationDate = as.Date(data$ListingCreationDate)
data$CreationMonth = as.Date(cut(data$ListingCreationDate, breaks='month'))

p_levels = unique(sort(format(as.Date(data$ListingCreationDate), '%m')))
data$NumericMonth = factor(format(as.Date(data$ListingCreationDate), 
                                          '%m'), 
                                   ordered=TRUE, 
                                   levels=p_levels)
# Term is probably more accurately reflected as an ordered factor as well
data$Term = factor(data$Term, ordered=TRUE, levels=c(12, 36, 60))
#Renaming ProsperRating and setting as a factor
names(data)[names(data) == 'ProsperRating..numeric.'] = 'ProsperRating'
data$ProsperRating = factor(data$ProsperRating, ordered=TRUE, levels=seq(1, 7, 1))
# Ordered factor for Prosper Score
data$ProsperScore = factor(data$ProsperScore, ordered=TRUE, levels=seq(1, 11, 1))
# Rename listing category and map to actual concepts
names(data)[names(data) == 'ListingCategory..numeric.'] = 'ListingCategory'
p_from = seq(0, 20, 1)
# this is frustrating, but I'm not sure there's a clearer way
p_to = c('Not Available', 
         'Debt Consolidation',
         'Home Improvement', 
         'Business',
         'Personal Loan', 
         'Student Use',
         'Auto',
         'Other',
         'Baby&Adoption',
         'Boat', 
         'Cosmetic Procedure', 
         'Engagement Ring', 
         'Green Loans',
         'Household Expenses',
         'Large Purchases',
         'Medical/Dental', 
         'Motorcycle',
         'RV', 
         'Taxes',
         'Vacation',
         'Wedding Loans')

data$ListingCategory = factor(mapvalues(data$ListingCategory, 
                                        from = p_from, 
                                        to = p_to), 
                              levels=p_to)
data$CreditScore = (data$CreditScoreRangeUpper + data$CreditScoreRangeLower) / 2
data$CreditScoreRangeUpper = NULL 
data$CreditScoreRangeLower = NULL 

Great, so what does our data look like now?

str(data)
## 'data.frame':    113937 obs. of  13 variables:
##  $ CreditGrade        : Ord.factor w/ 8 levels "AA"<"A"<"B"<"C"<..: 4 NA 7 NA NA NA NA NA NA NA ...
##  $ ListingCreationDate: Date, format: "2007-08-26" "2014-02-27" ...
##  $ Term               : Ord.factor w/ 3 levels "12"<"36"<"60": 2 2 2 2 2 3 2 2 2 2 ...
##  $ BorrowerAPR        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ ProsperRating      : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperScore       : Ord.factor w/ 11 levels "1"<"2"<"3"<"4"<..: NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory    : Factor w/ 21 levels "Not Available",..: 1 3 1 17 3 2 2 3 8 8 ...
##  $ StatedMonthlyIncome: num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ Investors          : int  258 1 41 158 20 1 1 1 1 1 ...
##  $ CreationMonth      : Date, format: "2007-08-01" "2014-02-01" ...
##  $ NumericMonth       : Ord.factor w/ 12 levels "01"<"02"<"03"<..: 8 2 1 10 9 12 4 5 12 12 ...
##  $ CreditScore        : num  650 690 490 810 690 ...
summary(data)
##   CreditGrade    ListingCreationDate  Term        BorrowerAPR     
##  C      : 5649   Min.   :2005-11-09   12: 1614   Min.   :0.00653  
##  D      : 5153   1st Qu.:2008-09-19   36:87778   1st Qu.:0.15629  
##  B      : 4389   Median :2012-06-16   60:24545   Median :0.20976  
##  AA     : 3509   Mean   :2011-07-08              Mean   :0.21883  
##  HR     : 3508   3rd Qu.:2013-09-09              3rd Qu.:0.28381  
##  (Other): 6745   Max.   :2014-03-10              Max.   :0.51229  
##  NA's   :84984                                   NA's   :25       
##  ProsperRating    ProsperScore             ListingCategory 
##  4      :18345   4      :12595   Debt Consolidation:58308  
##  5      :15581   6      :12278   Not Available     :16965  
##  6      :14551   8      :12053   Other             :10494  
##  3      :14274   7      :10597   Home Improvement  : 7433  
##  2      : 9795   5      : 9813   Business          : 7189  
##  (Other):12307   (Other):27517   Auto              : 2572  
##  NA's   :29084   NA's   :29084   (Other)           :10976  
##  StatedMonthlyIncome LoanOriginalAmount   Investors      
##  Min.   :      0     Min.   : 1000      Min.   :   1.00  
##  1st Qu.:   3200     1st Qu.: 4000      1st Qu.:   2.00  
##  Median :   4667     Median : 6500      Median :  44.00  
##  Mean   :   5608     Mean   : 8337      Mean   :  80.48  
##  3rd Qu.:   6825     3rd Qu.:12000      3rd Qu.: 115.00  
##  Max.   :1750003     Max.   :35000      Max.   :1189.00  
##                                                          
##  CreationMonth         NumericMonth    CreditScore   
##  Min.   :2005-11-01   01     :11214   Min.   :  9.5  
##  1st Qu.:2008-09-01   10     :10539   1st Qu.:669.5  
##  Median :2012-06-01   12     :10320   Median :689.5  
##  Mean   :2011-06-24   02     :10124   Mean   :695.1  
##  3rd Qu.:2013-09-01   09     :10074   3rd Qu.:729.5  
##  Max.   :2014-03-01   11     : 9952   Max.   :889.5  
##                       (Other):51714   NA's   :591

Univariate Analysis Plots

ggplot(aes(x=Investors), data=data) + 
  geom_histogram(boundary=0)  
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Intuitively, having a number of investors already paired with a loan might encourage further investment. This indicates that the number of investors might follow a multiplicative process. I’ll take a log transformation to better grasp the distribution.

ggplot(aes(x=Investors), data=data) + 
  geom_histogram(boundary=0) + 
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interesting, so if you throw out the common case where there are zero investors, the distribution appears roughly normal or skewed normal. Definitely want to look at that peak on the far left though

sort(table(data$Investors), decreasing=TRUE)[1:5]
## 
##     1     2     3     4     5 
## 27814  1386   991   827   753

The peak at the beginning is one of the issues with taking logs of counts. When the # of investors is 1, the log of that value is 0 which screws with my distribution quite a bit.

Next I want to look into what credit grade is assigned to most of the listings.

ggplot(aes(x=CreditGrade), data=subset(data, !is.na(CreditGrade))) + 
  geom_bar()

It would appear that most listings fall int he middle with a ‘C’ grade, although there are a significant number in all grades.

When was Prosper’s platform most popular? Is there a time when most of the listings were created?

ggplot(aes(x=CreationMonth, group=1), data=data) + 
  geom_point(stat='count') + 
  geom_line(stat='count') + 
  scale_x_date(labels = date_format('%Y-%m'),
               date_breaks = '6 month',
               limits = c(as.Date('2005-6-1'), as.Date('2014-12-1'))) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

There appears a sharp drop off in early to mid 2008 indicating an impact from the recession. There is also a sharp drop off in early 2014, and it might be useful to read up on the company and examine the possible cause. I don’t see a clear monthly pattern, but I’ll check the distribution just in case.

ggplot(aes(x=NumericMonth), data=data) +
  geom_bar()

Next we’ll check the frequency of different term lengths for the loans.

ggplot(aes(x=Term), data=data) + geom_bar()

It looks like the vast majority of loans are 36 month (3 year) loans, with very few being a 12 month loan. It might be interesting later to look at the relationship between loan term and the amount.

ggplot(aes(x=BorrowerAPR), data=data) + geom_histogram(boundary=0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 25 rows containing non-finite values (stat_bin).

summary(data$BorrowerAPR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

The APR distribution appears to have a couple of distinct peaks with a few clear outliers at the high and low end.

ggplot(aes(x=ProsperRating), data=subset(data, !is.na(ProsperRating))) +
         geom_bar()

ggplot(aes(x=ProsperScore), data=subset(data, !is.na(ProsperScore))) +
         geom_bar()

summary(data$ListingCategory)
##      Not Available Debt Consolidation   Home Improvement 
##              16965              58308               7433 
##           Business      Personal Loan        Student Use 
##               7189               2395                756 
##               Auto              Other      Baby&Adoption 
##               2572              10494                199 
##               Boat Cosmetic Procedure    Engagement Ring 
##                 85                 91                217 
##        Green Loans Household Expenses    Large Purchases 
##                 59               1996                876 
##     Medical/Dental         Motorcycle                 RV 
##               1522                304                 52 
##              Taxes           Vacation      Wedding Loans 
##                885                768                771

I want to look only at the most popular listing categories

sub_data = subset(data, !is.na(ListingCategory) &
                     ListingCategory != 'Other' &
                     ListingCategory != 'Not Available')

categories = names(sort(table(sub_data$ListingCategory), decreasing = T)[1:5])

ggplot(aes(x=ListingCategory), 
       data=subset(data, ListingCategory %in% categories)) + 
  geom_bar()

Clearly Debt Consolidation is by far the most popular use for Prosper.

summary(data$StatedMonthlyIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

I would like to be the guy who makes 175K a month

ggplot(aes(x=StatedMonthlyIncome), data=data) + 
  geom_histogram(boundary=0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Hmm, I going to go ahead an ignore the top 1% of values because that is cleary skewing the distribution.

ggplot(aes(x=StatedMonthlyIncome), data=data) + 
  geom_histogram(boundary=0) + 
  xlim(0, quantile(data$StatedMonthlyIncome, .99))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1140 rows containing non-finite values (stat_bin).

It looks like either a log normal or exponential distribution. I’ll take a log transformation to try and get a better idea of the data

ggplot(aes(x=StatedMonthlyIncome), data=data) + 
  geom_histogram(binwidth = .1) + 
  xlim(0, quantile(data$StatedMonthlyIncome, .99)) + 
  scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1394 rows containing non-finite values (stat_bin).

Now we’re looking roughly normal with a few clear outliers and a low standard deviation

I want to look into the high and low outliers that to see if there is anything weird going on.

head(data[order(data$StatedMonthlyIncome),])
##     CreditGrade ListingCreationDate Term BorrowerAPR ProsperRating
## 79            A          2008-06-29   36     0.18454          <NA>
## 101        <NA>          2013-01-25   36     0.35356             1
## 109           C          2008-04-29   36     0.28320          <NA>
## 126        <NA>          2011-05-10   36     0.35643             2
## 171        <NA>          2012-04-14   36     0.28851             3
## 187        <NA>          2012-02-03   36     0.35797             1
##     ProsperScore    ListingCategory StatedMonthlyIncome LoanOriginalAmount
## 79          <NA>           Business                   0               4800
## 101            3 Household Expenses                   0               4000
## 109         <NA> Debt Consolidation                   0              14000
## 126            5              Other                   0               5000
## 171            5              Other                   0              10000
## 187            5   Home Improvement                   0               4000
##     Investors CreationMonth NumericMonth CreditScore
## 79        115    2008-06-01           06       729.5
## 101        73    2013-01-01           01       729.5
## 109        53    2008-04-01           04       649.5
## 126        92    2011-05-01           05       649.5
## 171       161    2012-04-01           04       769.5
## 187        45    2012-02-01           02       669.5

A large number of people list their monthly income as 0, which screws with the log distribution. After examining the orignal dataframe, it’s difficult to tell whether this data is a mistake or not, so we’ll just take it at face value.

tail(data[order(data$StatedMonthlyIncome),])
##        CreditGrade ListingCreationDate Term BorrowerAPR ProsperRating
## 113271        <NA>          2013-02-20   36     0.22712             4
## 8067          <NA>          2011-04-02   36     0.35643             1
## 57134         <NA>          2011-03-24   36     0.35643             2
## 17412         <NA>          2011-04-08   36     0.35643             2
## 53015         <NA>          2013-01-04   12     0.25785             3
## 53168         <NA>          2012-02-20   36     0.35797             1
##        ProsperScore    ListingCategory StatedMonthlyIncome
## 113271            6              Other            394400.0
## 8067              4 Debt Consolidation            416666.7
## 57134             5   Home Improvement            466666.7
## 17412             5 Debt Consolidation            483333.3
## 53015             3           Business            618547.8
## 53168             4           Business           1750002.9
##        LoanOriginalAmount Investors CreationMonth NumericMonth CreditScore
## 113271               2000        26    2013-02-01           02       689.5
## 8067                 2000        10    2011-04-01           04       669.5
## 57134                4000        72    2011-03-01           03       689.5
## 17412                7500       128    2011-04-01           04       669.5
## 53015                4000        93    2013-01-01           01       729.5
## 53168                4000         2    2012-02-01           02       749.5
ggplot(aes(x=LoanOriginalAmount), data=data) + 
  geom_histogram(binwidth = 3000, boundary=0)

Not a clear distribution for loan amount, but it appears as though it could be normal. I’ll look at the distribution of the log value just to see how it looks.

ggplot(aes(x=LoanOriginalAmount), data=data) + 
  geom_histogram(binwidth = .1) + 
  scale_x_log10()

Original loan amount doesn’t really fit very neatly into a distribution.

CreditScore:

ggplot(aes(x=CreditScore), data=data) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 591 rows containing non-finite values (stat_bin).

Distribution appears roughly normal. It seems safe to assume that when the credit score is 0, that’s a case where the data is simply incomplete.

Univariate Analysis

What is the structure of your dataset?

There are 113,937 observations with 81 different features, however I chose to for this analysis I made some basic transformations and chose to focus on 12 of them.

  • CreditGrade
  • ListingCreationDate
  • CreationMonth
  • Term
  • BorrowerAPR
  • ProsperRating
  • ProsperScore
  • ListingCategory
  • CreditScore
  • StatedMonthlyIncome
  • LoanOriginalAmount
  • Investors

I chose to treat the variables Term, ProsperRating, and ProsperScore as ordered factors. The ordering is obvious as they are all whole numbers from 1 to the max level for the factor. ListingCategory is an unordered factor with 21 levels.

Other observations:

  • Most loans have a small number of investors with 0 being most common.
  • Credit grades are pretty evenly distributed, although B,C, and D are more typical.
  • There was a drop off in listings with the recession and in general the service has become more popular over time.
  • There is no pattern for creation month.
  • The vast majority of loans have a 36 month term.
  • BorrowerAPR is normally distributed.
  • By far the most common listing category is “Debt Consolidation”

What is/are the main feature(s) of interest in your dataset?

The main features of interest are BorrowerAPR and factors that intuitively correlate with risk, with the core one being CreditScore.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think that ProsperRating, ProsperScore, and CreditGrade are all important factors in determining BorrowerAPR. I cannot say with certainty which will be most influential.

Did you create any new variables from existing variables in the dataset?

I averaged the high credit score and the low credit score in order to create a single numeric credit score. I also found the month for each listing creation date to look for any patterns.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

A number of the features were clearly exponential distributions. When analyzing relationships, it is generally more informative when a variable is normally distributed. For these variables I looked at the log of the distribution. Eg:

  • Investors
  • StatedMonthlyIncome (Also want to exclude very low values)
  • LoanOriginalAmount

Bivariate Plots Section

This is pretty noisy so I’ll look into relationships with individual plots.

First I want to look more closely at the impact of CreditGrade. For these plots I’ll exclude ‘NA’ and ‘NC’ grades, considering them incomplete data.

credit_grade_idx = (data$CreditGrade != 'NC') & (!is.na(data$CreditGrade))
ggplot(aes(x=CreditGrade, y=BorrowerAPR), data=data[credit_grade_idx, ]) + 
  geom_boxplot()
## Warning: Removed 24 rows containing non-finite values (stat_boxplot).

As expected, loans with better credit grades resulted in a lower rate for the borrower. However, for all cases there are borrowers that represent significant outliers. I will look into these outlying observations later in the analysis.

I also want to see if there is a relationship between the Term of the loan and the rate borrowers pay. Intuitively you might expect borrowers with longer loan periods to pay a lower rate.

ggplot(aes(x=Term, y=BorrowerAPR), data=data) + 
  geom_boxplot()
## Warning: Removed 25 rows containing non-finite values (stat_boxplot).

It’s interesting we see almost no difference in APR between different term periods. However, it is worth noting that the 60 month loans have a much lower spread than the shorter loans. For concrete numbers here are the standard deviations:

apr_idx = !is.na(data$BorrowerAPR)
by(data[apr_idx,]$BorrowerAPR, data[apr_idx,]$Term, sd)
## data[apr_idx, ]$Term: 12
## [1] 0.09023572
## -------------------------------------------------------- 
## data[apr_idx, ]$Term: 36
## [1] 0.08548865
## -------------------------------------------------------- 
## data[apr_idx, ]$Term: 60
## [1] 0.05747886

And interquartile ranges:

by(data[apr_idx,]$BorrowerAPR, data[apr_idx,]$Term, IQR)
## data[apr_idx, ]$Term: 12
## [1] 0.144505
## -------------------------------------------------------- 
## data[apr_idx, ]$Term: 36
## [1] 0.14408
## -------------------------------------------------------- 
## data[apr_idx, ]$Term: 60
## [1] 0.08534

Prosper Rating:

rating_idx = !is.na(data$ProsperRating)
ggplot(aes(x=ProsperRating, y=BorrowerAPR), data=data[rating_idx,]) + 
  geom_boxplot()

As you might expect, higher ratings pay a lower APR and the differences appear pretty significant. What is surprising is the huge number of outliers, especially for the lowest rating.

ProsperScore:

score_idx = !is.na(data$ProsperScore)
ggplot(aes(x=ProsperScore, y=BorrowerAPR), data=data[score_idx, ]) + 
  geom_boxplot()

The general pattern is consistent with what you might expect, but there is quite a bit more variability than with ProsperRating. Clearly many investors have different feelings about the level of risk assigned to loans than Prosper.

ListingCategory:

listing_idx = (!is.na(data$ListingCategory) & 
                 (data$ListingCategory != 'Not Available'))
ggplot(aes(x=ListingCategory, y=BorrowerAPR), data=data[listing_idx, ]) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0))

In general, the listing category doesn’t seem to have a big impact on the apr the borrower pays. What interesting however are the two apparrent exceptions with personal loans paying a lower apr and cosmetic procedures paying a higher apr. Perhaps this says something about the biases of the lenders.

StatedMonthlyIncome:

Next we’ll look at a continuous variable.

income_idx = !is.na(data$StatedMonthlyIncome)
ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR), 
       data=data[income_idx, ]) + 
  geom_point()
## Warning: Removed 25 rows containing missing values (geom_point).

Recall from the histogram of income that a few very high values skew the distribution. I’ll remove those to get a better idea of the relationship.

ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR), 
       data=data[income_idx, ]) + 
  geom_point() + 
  xlim(0, quantile(data$StatedMonthlyIncome, .99))
## Warning: Removed 1163 rows containing missing values (geom_point).

A lot of incomes and APRs clearly are set to neat values. I’ll jitter the points to help get a clearer picture as well as increase the transparency.

income_log_idx = data$StatedMonthlyIncome > 1
ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR), 
       data=data[income_idx & income_log_idx, ]) + 
  geom_point(alpha=.01, position = 'jitter') + 
  xlim(0, quantile(data$StatedMonthlyIncome, .99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 1163 rows containing non-finite values (stat_smooth).
## Warning: Removed 1163 rows containing missing values (geom_point).

There doesn’t appear to be any relationship at all between income and APR. If anything there is a slight negative relationship. This is surprising to me, but I suspect those with larger incomes also ask for larger loan amounts. I would also like to try a simple log transformation, as earlier that helped the income distribution appear more normal.

ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR), 
       data=data[income_idx, ]) + 
  geom_point(alpha=.01, position = 'jitter') + 
  #xlim(1, quantile(data$StatedMonthlyIncome, .99)) + 
  scale_x_log10(limits=c(quantile(data$StatedMonthlyIncome, .05), 
                         quantile(data$StatedMonthlyIncome, .99))) + 
  geom_smooth()
## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis
## `geom_smooth()` using method = 'gam'
## Warning: Removed 6844 rows containing non-finite values (stat_smooth).
## Warning: Removed 5468 rows containing missing values (geom_point).

LoanOriginalAmount:

amount_idx = !is.na(data$LoanOriginalAmount)
ggplot(aes(x=LoanOriginalAmount, y=BorrowerAPR), 
       data=data[amount_idx, ]) + 
  geom_point()
## Warning: Removed 25 rows containing missing values (geom_point).

Again high density clusters of points and distinct verticals that are common. I’ll fade and jitter the points.

ggplot(aes(x=LoanOriginalAmount, y=BorrowerAPR), 
       data=data[amount_idx, ]) + 
  #geom_point(alpha=.01, position = 'jitter_x')
  geom_jitter(alpha=.01, width=100) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).

There is a large amount of variability, but there does appear to be a slight negative relationship between loan amount and borrower APR. I suspect this is a result of wealthier borrowers taking out larger loans. We’ll check that assumption quickly.

ggplot(aes(x=LoanOriginalAmount, y=StatedMonthlyIncome), 
       data=data) + 
  geom_point(position='jitter', alpha=.01) + 
  ylim(0, quantile(data$StatedMonthlyIncome, .99)) + 
  geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 1140 rows containing non-finite values (stat_smooth).
## Warning: Removed 1851 rows containing missing values (geom_point).

It’s not perfect, but there’s clearly a positionve linear relationship between income and loan amount which supports my theory.

Investors:

investor_idx = !is.na(data$Investors)
ggplot(aes(x=Investors, y=BorrowerAPR), data=data[investor_idx, ]) + 
  geom_point()
## Warning: Removed 25 rows containing missing values (geom_point).

In the univariate analysis it was clear that Investors follows an exponential relationship so I’ll take a log transformation. I also want to exclude cases with 0 investors.

ggplot(aes(x=Investors, y=BorrowerAPR), data=data[investor_idx, ]) + 
  geom_point(alpha=.05, position='jitter') + 
  scale_x_log10(limits=c(1, max(data$Investors)))
## Warning: Removed 13865 rows containing missing values (geom_point).

There isn’t really a clear picture about the relationship between the number of investors and the APR.

CreditScore: (Assuming anything under 300 is a mistake)

credit_score_idx = !is.na(data$CreditScore) & (data$CreditScore >= 300)
ggplot(aes(x=CreditScore, y=BorrowerAPR), data=data[credit_score_idx, ]) + 
  geom_point(position = 'jitter', alpha=.01) + 
  geom_smooth(method='lm')

As expected, there appears to be a negative linear relationship between credit score and APR.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Unsurprisingly, the different grading systems (CreditGrade, ProsperRating, ProsperScore) were all strong predictors of BorrowerAPR, with ProsperScore being the least reliable.

The length of the loan (Term) does not appear to impact the BorrowerAPR.

ListingCategory seems not to matter much with the notable exceptions of ‘Cosmetic Procedure’ and ‘Personal Loan’.

After removing outliers, there appears to be a weak negative linear relationship between StatedMonthlyIncome and BorrowerAPR.

There is also a weak negative linear relationship between LoanOriginalAmount and BorrowerAPR.

Also there is a clear negative lineral relationship between CreditScore and BorrowerAPR.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

There is a strong positive relationship between StatedMonthlyIncome and LoanOriginalAmount so I suspect having both in the model would be a solid predictor of BorrowerAPR.

What was the strongest relationship you found?

The relationship between CreditScore and BorrowerAPR. Intuitively, this makes a lot of sense.

Multivariate Plots Section

I want to explore the obvious outliers in my BorrowerAPR vs CreditGrade boxplots. Why are some borrowers with good credit grades receiving high APRs and vice versa.

Including CreditScore doesn’t help us explain the CreditGrade APR outliers. It is clear from the clean color bands in the plot that CreditScore is almost entirely determinite of CreditGrade with the bands becoming a little fuzzier for the lower credit scores.

sub_data = data[income_idx & income_log_idx & credit_grade_idx,]
ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR), data=sub_data) + 
  geom_point(position='jitter', alpha=1, aes(color=CreditGrade)) + 
  xlim(0, quantile(sub_data$StatedMonthlyIncome, .99)) + 
  scale_color_brewer(type='seq')
## Warning: Removed 303 rows containing missing values (geom_point).

There actually doesn’t appear to be a strong relationship between income and borrower APR at all. It might be more interesting to look at loan amount relative to income.

sub_data = data[income_idx & income_log_idx & credit_grade_idx,]
ggplot(aes(x=LoanOriginalAmount/StatedMonthlyIncome, y=BorrowerAPR), 
       data=sub_data) + 
  geom_point(position='jitter', alpha=1, aes(color=CreditGrade)) + 
  xlim(0, 
       quantile(sub_data$LoanOriginalAmount/sub_data$StatedMonthlyIncome, .99))+
  scale_color_brewer(type='seq')
## Warning: Removed 303 rows containing missing values (geom_point).

wh, so this is a little more informative. There appears to be a positive linear relationship between this ratio and and APR. Let’s limit ourselves ot just high credit grades.

idx = data$CreditGrade %in% c('AA', 'A')
sub_data = data[income_idx & income_log_idx & credit_grade_idx & idx,]
ggplot(aes(x=LoanOriginalAmount/StatedMonthlyIncome, y=BorrowerAPR), 
       data=sub_data) + 
  geom_point(position='jitter', alpha=.3, aes(color=CreditGrade)) + 
  xlim(0, 
       quantile(sub_data$LoanOriginalAmount/sub_data$StatedMonthlyIncome, .99)) +
  geom_smooth(method='lm')
## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 80 rows containing missing values (geom_point).

It looks like the ratio of the loan to income helps to explain at least part of the reason borrowers with a high credit grade might be given a higher APR. This is especially true for high credit grade borrowers.

Let’s also see if time when the loan is given has an impact on APR.

ggplot(aes(x=CreationMonth, y=BorrowerAPR), 
       data=data[credit_grade_idx,]) + 
  geom_point(position='jitter', alpha=1, aes(color=CreditGrade)) + 
  scale_color_brewer(type='seq')
## Warning: Removed 24 rows containing missing values (geom_point).

There’s definitely a weird cap on higher APR rates for low credit grade borrowers during the midst of the recession.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It appears that a borrower’s credit grade is almost entirely a function of their averaged credit score. However, this doesn’t entirely explain some of the weird outliers. One possible explantion appeared to be the loan to income ratio.

Were there any interesting or surprising interactions between features?

It was surprising how perfectly credit grade maps to credit score. I would have thought Prosper used a more sophisticated mdoel.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I used a basic linear regression to model the relationship between APR and the loan amount to income ratio for high credit grade borrowers.


Final Plots and Summary

Plot One

Description One

Although the listing category may not be the most informative feature for most loans, I though it was interesting because of the outliers. It’s clear that the lending market is not kind to those taking out loans for “cosmetic procedures”, but there is often times a discount for something as generic as “personal use”.

Plot Two

Description Two

Unsurprisingly one of the best predictors of a borrower’s APR is their credit score. It just demonstrates that even on more modern lending platforms, it’s still important to carefully manage your credit.

Plot Three

## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 82 rows containing missing values (geom_point).

Description Three

This plot shows the impact of a borrower’s loan amount to income ratio on APR. It demonstrates that people who borrow a lot realtive to their monthly income can expect to pay a higher APR.


Reflection

The Prosper dataset contains over 100,000 loans with 81 different features for each loan. In my project, I chose to focus on only a small subset of those features. First I plotted the distribution of each feature individually in order to get a feel for my data and look for potential outliers or oddities.

Next I looked into what factors determined the APR for an individual borrower. Unsurprisingly, I found the Prosper rating systems such as Credit Grade, Prosper Rating, and Prosper Score were all strong indicators of what APR a borrower might pay. Additionally, credit score was strongly correlated with APR. However, there were numerous outliers and I wanted to examine how these occurred. I found that borrowers with high credit grades who still borrowed a large amount relative to their monthly incomes were likely to pay a higher APR than you might otherwise expect.

There are obvious limitation with the conclusions I draw. First, I only looked at a very small subset of the total number of features. There could be much stronger correlations and patterns that I’m not seeing. Second, my final plots only looked at Credit Grade, which only exists up till 2009, so I’m looking at only a subset of my data. Given that there are so many variables that can affect APR, in the future I might look into outside economic indicators such as Fed interest rates to see if those cause some of the noise in my data.